## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
I discuss the stats further below.
From this first chart, you can clearly see most data lies in the 5-6 range for quality, which could make seeing trends difficult, especially since quality only takes integer values.
Alcohol distribution is slightly skewed right, with a mean of 10.20% abv.
Residual sugar is very skewed. This variable may benefit from a log transformation. There are some outliers on the high end as the max is 15.5 but only 25% of the wines are above 2.6.
The above features are all related to acid levels so I grouped them together. Fixed and volatile acidity are fairly normal with some right skew. Volatile acidity is an order of magnitude smaller. Citric acid seems to have a lot of zero valued points. pH is normally distributed with a range of 2.74 to 4.01.
table(wines$citric.acid)
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
I see there are in fact 132 wines with 0 for their citric acid value, a little under 1% of the data.
Sulfur dioxide (\(SO_2\)), free or total is right skewed. There is an outlier with pretty high total \(SO_2\) content (almost 300), while 75% of the data is below 62. \(SO_2\) helps prevent microbial growth, but a value above 50 (for the free form) can be detectable in the wine which is generally undesireable.
table(wines$free.sulfur.dioxide>50)
##
## FALSE TRUE
## 1583 16
There are only 16 wines with this high level of free \(SO_2\).
Sulphates similarly act to help preserve wine. The data is fairly normal but there are some outliers on the high side. Maybe the same wines that had high free \(SO_2\) content.
table(wines$free.sulfur.dioxide>50 & wines$sulphates>1)
##
## FALSE
## 1599
This does not seem to be the case.
Chlorides seemed right skewed but actually the finer binwidth and zooming in shows that it is close to normal distribution around a median of 0.79, with a lot of outliers higher than ~0.15.
Density is a very close to normal distribution with a very small variance.
sd(wines$density)
## [1] 0.001887334
The standard deviation is only ~0.002.
All of the variables are floats with the exception of quality which is an integer. There are no missing values. We have 1599 wines in the dataset, each with 12 features.
The quality variable is probably the most interesting in terms of seeing how it might relate to other features or developing a predictive model.
investigation into your feature(s) of interest? Any of the other features might have a trend with quality, but I’d guess some of the acid measurements and the sulfur metrics might be important for quality, while something like density is probably not as important since it has such a small range.
No I did not create any new features. I may transform some of the skewed variables with log or sqrt transforms.
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? No operations were needed to tidy the data, all observations were complete. There were some unusual distributions and outliers. The outliers seem to be true data values rather than input errors, though, so I did not want to get rid of any of these values.
Several of the variables seem to be related to acidity - fixed.acidity, volatile.acidity, citric.acid, and pH. pH is measured on the typical scale (lower pH is more acidic), while the others are measured in \(g/dm^3\). We can see that fixed acidity dominates in terms of the relative magnitude, with a mean of 8.32, while volatile acidity has a mean of 0.53 and citric acid a mean of 0.27.
Residual sugar seems to have a pretty large range, with a min of 0.9 and max of 15.5 (in \(g/dm^3\)), but 75% of the values are below 2.6, so there is some skew. Chlorides and sulphates are similarly skewed, and chlorides and sulfur dioxides are a few orders of magnitude smaller (units are different for SO2). Density has barely any variability (~1%).
The alcohol content (abv) varies from a minimun of 8.4 to a maximum of 14.9 with the median value at 10.20. This seems a little low for red wines, but I suppose this particular type (Vinho Verde) may tend to be lower alcohol.
Finally, quality ratings are on a 1-10 scale, but we only have measurements in the 3-8 range and at least 50% of the wines were rated 5 or 6. The mean rating is 5.64.
The pairs plot shows the distributions as seen above, some simple scatterplots and also the correlations between variables. None have a very high correlation with quality, but alcohol has the highest at ~0.48.
I made a few more histograms below to examine some of the variables closer and see some initial binary relationships using color.
I thought these plots might show a relation between \(SO_2\) and quality but it turns out both high and low quality wines occur along the whole range. The outlier in total \(SO_2\) is actually a 7 in quality which is interesting.
From these two charts, it seems like higher quality wines tend to be slighlty higher in alcohol and lower in volatile acidity. I looked into the alcohol a little further.
You can definitely see that wines in the 9-11 abv range get a lot of 5 ratings, while wines in the ~11.5-13 range get a higher percentage of 6’s and even quite a few 7’s. There are no 8’s below ~9.8 and no 3’s above ~11.
I tried some boxplots, splitting up the wines by quality rating, to see if these trends popped out any more.
Volatile acidty shows a downward trend with wine quality.
Alcohol content increases with wine quality.
Citric acid increases with wine quality.
Sulphates increase with wine quality.
These do a better job of showing some of the trends of various features with quality, and also show that increased sulphate content trends with higher quality as does citric acid level. This makes sense since sulphates help preserve the wine, and citric acid in small amounts is crisp and refreshing. It should be noted that the quality in the 4-7 range should be more strongly considered since there are relatively few points outside that range.
Density plots can be an interesting way to show the variation in a feature, and here I split by quality levels to show some of the trends discovered above.
Density plot of alcohol grouped by quality.
Density plot of fixed acidity grouped by quality.
Density plot of volatile acidity grouped by quality.
Density plot of citric acid grouped by quality.
Density plot of pH grouped by quality.
Density plot of residual sugar (log scale) grouped by quality.
These plots are interesting because you see not only from the location of the peaks where a feature is centered, but also how variable it is within a quality range. For instance, citric acid content in higher quality wines (7-8) is greater and somewhat less variable (at least for the 7’s). Overall these plots are somewhat tough to read, though, and don’t contain much more information than the boxplots.
I thought it would be interesting to take a slightly closer look at the acidity variables and how they are related, so I made some scatter plots.
It is interesting to me that fixed acidity, pH and citric acid have pretty clear relationships, but much less so for volatile acidity.
investigation. How did the feature(s) of interest vary with other features in the dataset? The surprising trend I saw was that quality scored tended to increase with alcohol content. The boxplots do a good job showing this and other trends, such as a decrease with volatile acid or a slight increase with sulphates.
(not the main feature(s) of interest)? There were some interesting relationships among the acidity variables, for instance that volatile acidity did not really correlate with the other metrics.
Alcohol had the strongest correlation with quality, in the positive direction, and volatile acidity had the strongest in the negative direction.
I know that residual sugar and acidity are often things that must be in balance with alcohol content to make a good wine, so i tried a few scatterplots to see if i could look at multiple variables and see any trends. Residual sugar was skewed so I log transformed that variable. The built in color gradient was tough for me to see so I switched to a rainbow color scale. There is lots of overplotting since the quality only takes integer values, so I am using geom_jitter and adding some transparency to the points.
Scatterplot of quality vs. alcohol with points jittered for clarity.
Scatterplot of quality vs. sugar colored by alcohol. No trend.
Scatterplot of quality vs. fixed acidity colored by alcohol.
Scatterplot of quality vs. volatile acidity colored by alcohol.
Scatterplot of quality vs. pH colored by alcohol.
Scatterplot of quality vs. citric acid colored by alcohol.
Scatterplot of quality vs. citric acid colored by volatile acidity.
The fixed and volatile acidity plots may have had hints of trends (slight uptick in quality with fixed acidity but downturn with volatile acidity), but the coloration by alcohol did not seem too informative. I thought it might be easier to see patterns if I colored by quality since that is a discrete variable.
These plots are interesting because there does seem to be a trend for the higher quality wines to occupy the top right space of the graphs while the lower quality trend toward the bottom left areas (note that I had to invert volatile acidity and pH directions because higher pH is less acidic and volatile acidity actually follows the opposite trend).
Facetting these plots by quality is another way to look at the data, so I tried that below.
The trend that higher fixed acidity and alcohol correlate with higher quality can be pretty clearly seen by the plot above. The cluster of points moves up and right as the quality score increases.
This plot is sort of interesting. There is maybe a little too much going on, but it is showing that for higher quality wines, volatile acidity tends to be low (more orange/red points) and the citric acid and alcohol contents are high.
I also wanted to take a closer look at residual sugar, focusing on acidity and abv instead. Below are some scatterplots looking at sugar and its relation to some other variables like acidity, alcohol, chlorides (salt), and quality.
Scatterplot of alcohol vs. sugar colored by quality.
Scatterplot of citirc acid vs. sugar colored by quality.
Scatterplot of volatile acidity vs. sugar colored by quality.
Boxplot of sugar content grouped by quality score.
Scatterplot of chlorides vs. sugar colored by alcohol.
Same plot as before only zoomed in and added some transparency.
I don’t get very much information from these plots about any relation between residual sugar and quality. The boxplot shows that for all the quality levels the variation of residual sugar is pretty comparable and there is no clear trend. The last graph does have some color separation, showing that for a given sugar level, higher salt and lower alcohol may be correlated. But this is not very convincing and I’m not really sure what that would mean anyway.
investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest? The scatterplots showed how most higher quality wines were either higher in alcohol, acidity, or medium levels of both. This illustrates that is not just one aspect, but several, that contribute to a quality wine.
I was surprised that I couldn’t find any relation between sugar and quality or even between sugar and acidity really. I would think that it is very important in wine making to keep the sweetness and acid levels in balance.
and limitations of your model. I did not create any models.
The boxplot above illustrates a clear downward trend in volatile acidity with higher quality wine. At low quality scores of 3 and 4, there is a wider range for the middle 50% of volatile acidity values and the mean values are 0.88 and 0.69, respectively. For wines in the 5-6 range, the spread of volatile acidity values shrinks for the middle 50% of wines and the means drop first to 0.58, then 0.50. Finally, on the high end of quality, the spread stays small and the mean volatile acidity for wines of quality score 7 is only 0.40. Volatile acidity is measured in \(g/dm^3\).
This plot shows how higher quality wines tend to have higher levels of alcohol, acidity (but not volatile acidity, as discussed above), or both. The high quality wines (in blue) trend toward the upper right, while low quality (in red) occupies the bottom left. Note that the scale for pH has been flipped so that acidity increases from left to right. This plot illustrates that a wine with neither very much alcohol nor acidity will probably come off tasting flat or bland and not get a high quality score. Acid helps make a wine crisp and refreshing, while alcohol give the wine some heat and a fuller body, so the two together can make some of the best wines. Though there are relatively few datapoints to go off of, it is worth noting that the majority of the wines that scored an 8 are below pH 3.5 and above 11% abv.
This figure depicts the trend of increasing fixed acidity and alcohol with higher quality score. Here I chose to double encode quality with color and faceting, which makes it easy to see the cluster of points move up and right as quality score increases. The mean abv for wines in the 5, 6, 7 range goes from 9.9 to 10.6 to 11.5. The mean fixed acidity for the same set increase from 8.17 to 8.34 to 8.87. This reinforces the conclusion that a higher quality wine of this type will typically be on the higher end of alcohol and acidity values.
This analysis looked at a dataset on Vinho Verde Red Wines containing almost 1600 wines. The data contained several metrics measuring each wines alcohol, acidity, sweetness, and other factors, as well as a quality score on a 1 to 10 scale. For this analysis, I thought it would be most interesting to try to find relationships between the featuers of the wine and the quality score. The first difficulty with this was that quality took on discrete, integer values only, and the majority were either 5 or 6, so this obscured some of the trends I was looking for. One way I overcame this was to group wines of similar quality together and look at the “average” wine (and range of wines) that received that score, for example with the boxplots or density plots. Another way was to use quality as the variable to color by, since this was the only discrete variable in the dataset (more of an ordered factor really). This showed some clear trends that informed later analysis. Another difficulty was with seeing the trends in multivariate scatterplots. I was able to decipher the relationships better when using graphic tools such as changing the color gradient or faceting by quality.
I think a further investigation of this dataset could yield interesting predicitive models that could be used to estimate a wine’s quality score based on it’s chemical attributes. However, I think such a model would have a hard time being very accurate, because there is a good amount of variability among the data. A finer resolution quality scale (i.e. ratings from 1-100), might help make a better model and reduce error. Other data besides chemical features, such as where the grapes where from or the year, might be important for quality predictions as well, and it would be interesting to have the price data to see how well price correlates with quality. Still, just from this analysis, it seems that a Red Vinho Verde producer should be trying to make a wine on the higher end of alcohol or acidity (or medium levels of both) if they want a good quality score. Also other factors like having enough sulphates to preserve the wine and keeping volatile acidity low are helpful in making the highest quality wines.